335 research outputs found

    FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

    Full text link
    We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (n2Dn^2D), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results

    Distribution of sizes of erased loops for loop-erased random walks

    Get PDF
    We study the distribution of sizes of erased loops for loop-erased random walks on regular and fractal lattices. We show that for arbitrary graphs the probability P(l)P(l) of generating a loop of perimeter ll is expressible in terms of the probability Pst(l)P_{st}(l) of forming a loop of perimeter ll when a bond is added to a random spanning tree on the same graph by the simple relation P(l)=Pst(l)/lP(l)=P_{st}(l)/l. On dd-dimensional hypercubical lattices, P(l)P(l) varies as lσl^{-\sigma} for large ll, where σ=1+2/z\sigma=1+2/z for 1<d<41<d<4, where z is the fractal dimension of the loop-erased walks on the graph. On recursively constructed fractals with d~<2\tilde{d} < 2 this relation is modified to σ=1+2dˉ/(d~z)\sigma=1+2\bar{d}/{(\tilde{d}z)}, where dˉ\bar{d} is the hausdorff and d~\tilde{d} is the spectral dimension of the fractal.Comment: 4 pages, RevTex, 3 figure

    Sequential Hypothesis Tests for Adaptive Locality Sensitive Hashing

    Full text link
    All pairs similarity search is a problem where a set of data objects is given and the task is to find all pairs of objects that have similarity above a certain threshold for a given similarity measure-of-interest. When the number of points or dimensionality is high, standard solutions fail to scale gracefully. Approximate solutions such as Locality Sensitive Hashing (LSH) and its Bayesian variants (BayesLSH and BayesLSHLite) alleviate the problem to some extent and provides substantial speedup over traditional index based approaches. BayesLSH is used for pruning the candidate space and computation of approximate similarity, whereas BayesLSHLite can only prune the candidates, but similarity needs to be computed exactly on the original data. Thus where ever the explicit data representation is available and exact similarity computation is not too expensive, BayesLSHLite can be used to aggressively prune candidates and provide substantial speedup without losing too much on quality. However, the loss in quality is higher in the BayesLSH variant, where explicit data representation is not available, rather only a hash sketch is available and similarity has to be estimated approximately. In this work we revisit the LSH problem from a Frequentist setting and formulate sequential tests for composite hypothesis (similarity greater than or less than threshold) that can be leveraged by such LSH algorithms for adaptively pruning candidates aggressively. We propose a vanilla sequential probability ration test (SPRT) approach based on this idea and two novel variants. We extend these variants to the case where approximate similarity needs to be computed using fixed-width sequential confidence interval generation technique

    You can't see what you can't see: Experimental evidence for how much relevant information may be missed due to Google's Web search personalisation

    Full text link
    The influence of Web search personalisation on professional knowledge work is an understudied area. Here we investigate how public sector officials self-assess their dependency on the Google Web search engine, whether they are aware of the potential impact of algorithmic biases on their ability to retrieve all relevant information, and how much relevant information may actually be missed due to Web search personalisation. We find that the majority of participants in our experimental study are neither aware that there is a potential problem nor do they have a strategy to mitigate the risk of missing relevant information when performing online searches. Most significantly, we provide empirical evidence that up to 20% of relevant information may be missed due to Web search personalisation. This work has significant implications for Web research by public sector professionals, who should be provided with training about the potential algorithmic biases that may affect their judgments and decision making, as well as clear guidelines how to minimise the risk of missing relevant information.Comment: paper submitted to the 11th Intl. Conf. on Social Informatics; revision corrects error in interpretation of parameter Psi/p in RBO resulting from discrepancy between the documentation of the implementation in R (https://rdrr.io/bioc/gespeR/man/rbo.html) and the original definition (https://dl.acm.org/citation.cfm?id=1852106) as per 20/05/201

    Set Similarity Search for Skewed Data

    Get PDF
    Set similarity join, as well as the corresponding indexing problem set similarity search, are fundamental primitives for managing noisy or uncertain data. For example, these primitives can be used in data cleaning to identify different representations of the same object. In many cases one can represent an object as a sparse 0-1 vector, or equivalently as the set of nonzero entries in such a vector. A set similarity join can then be used to identify those pairs that have an exceptionally large dot product (or intersection, when viewed as sets). We choose to focus on identifying vectors with large Pearson correlation, but results extend to other similarity measures. In particular, we consider the indexing problem of identifying correlated vectors in a set S of vectors sampled from {0,1}^d. Given a query vector y and a parameter alpha in (0,1), we need to search for an alpha-correlated vector x in a data structure representing the vectors of S. This kind of similarity search has been intensely studied in worst-case (non-random data) settings. Existing theoretically well-founded methods for set similarity search are often inferior to heuristics that take advantage of skew in the data distribution, i.e., widely differing frequencies of 1s across the d dimensions. The main contribution of this paper is to analyze the set similarity problem under a random data model that reflects the kind of skewed data distributions seen in practice, allowing theoretical results much stronger than what is possible in worst-case settings. Our indexing data structure is a recursive, data-dependent partitioning of vectors inspired by recent advances in set similarity search. Previous data-dependent methods do not seem to allow us to exploit skew in item frequencies, so we believe that our work sheds further light on the power of data dependence

    Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

    Full text link
    Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online

    BagMinHash - Minwise Hashing Algorithm for Weighted Sets

    Full text link
    Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very efficient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets is still a time consuming task. BagMinHash is a new algorithm that can be orders of magnitude faster than current state of the art without any particular restrictions or assumptions on weights or data dimensionality. Applied to the special case of unweighted sets, it represents the first efficient algorithm producing independent signature components. A series of tests finally verifies the new algorithm and also reveals limitations of other approaches published in the recent past.Comment: 10 pages, KDD 201

    Minimizing energy below the glass thresholds

    Full text link
    Focusing on the optimization version of the random K-satisfiability problem, the MAX-K-SAT problem, we study the performance of the finite energy version of the Survey Propagation (SP) algorithm. We show that a simple (linear time) backtrack decimation strategy is sufficient to reach configurations well below the lower bound for the dynamic threshold energy and very close to the analytic prediction for the optimal ground states. A comparative numerical study on one of the most efficient local search procedures is also given.Comment: 12 pages, submitted to Phys. Rev. E, accepted for publicatio

    Complexity transitions in global algorithms for sparse linear systems over finite fields

    Full text link
    We study the computational complexity of a very basic problem, namely that of finding solutions to a very large set of random linear equations in a finite Galois Field modulo q. Using tools from statistical mechanics we are able to identify phase transitions in the structure of the solution space and to connect them to changes in performance of a global algorithm, namely Gaussian elimination. Crossing phase boundaries produces a dramatic increase in memory and CPU requirements necessary to the algorithms. In turn, this causes the saturation of the upper bounds for the running time. We illustrate the results on the specific problem of integer factorization, which is of central interest for deciphering messages encrypted with the RSA cryptosystem.Comment: 23 pages, 8 figure

    Minimum spanning trees on random networks

    Full text link
    We show that the geometry of minimum spanning trees (MST) on random graphs is universal. Due to this geometric universality, we are able to characterise the energy of MST using a scaling distribution (P(ϵ)P(\epsilon)) found using uniform disorder. We show that the MST energy for other disorder distributions is simply related to P(ϵ)P(\epsilon). We discuss the relationship to invasion percolation (IP), to the directed polymer in a random media (DPRM) and the implications for the broader issue of universality in disordered systems.Comment: 4 pages, 3 figure
    corecore